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Abstract. We investigate the properties of the Hybrid Monte-Carlo algo- 
rithm (HMC) in high dimensions. HMC develops a Markov chain reversible 
w.r.t. a given target distribution n by using separable Hamiltonian dynamics 
with potential — log EL The additional momentum variables are chosen at ran- 
dom from the Boltzmann distribution and the continuous-time Hamiltonian 
dynamics are then discretised using the leapfrog scheme. The induced bias is 
removed via a Metropolis-Hastings accept/reject rule. In the simplified sce- 
nario of independent, identically distributed components, we prove that, to 
obtain an 0(1) acceptance probability as the dimension d of the state space 
tends to oo, the leapfrog step-size h should be scaled as h = I X d~ 1 / 4 . There- 
fore, in high dimensions, HMC requires C(d 1//4 ) steps to traverse the state 
space. We also identify analytically the asymptotically optimal acceptance 
probability, which turns out to be 0.651 (to three decimal places). This is 
the choice which optimally balances the cost of generating a proposal, which 
decreases as / increases, against the cost related to the average number of 
proposals required to obtain acceptance, which increases as I increases. 



1. Introduction 

The Hybrid Monte Carlo (HMC) algorithm originates from the physics literature 
[8] where it was introduced as a fast method for simulating molecular dynamics. 
It has since become popular in a number of application areas including statistical 
physics [10, 11, 28, 16, 1], computational chemistry [15, 19, 27, 30], data assimilation 
[2], geophysics [19] and neural networks [21, 31]. The algorithm has also been 
proposed as a generic tool for Bayesian statistical inference [20, 6, 9]. 

HMC has been proposed as a method to improve on traditional Markov Chain 
Monte Carlo (MCMC) algorithms. There are heuristic arguments to suggest why 
HMC might perform better, for example based on the idea that it breaks down 
random walk-like behaviour intrinsic to many MCMC algorithms such as Random- 
Walk Metropolis (RWM) algorithm. However there is very little theoretical under- 
standing of this phenomenon (though see [7]). This lack of theoretical guidance 
of choosing the free parameters for the algorithm partly accounts for its relative 
obscurity in statistical applications. The aim of this paper is to provide insight into 
the behavior of HMC in high dimensions and develop theoretical tools for improving 
the efficiency of the algorithm. 

HMC uses the derivative of the target probability log-density to guide the Monte- 
Carlo trajectory towards areas of high probability. The standard RWM algorithm 
[18] proposes local, symmetric moves around the current position. In many cases 
(especially in high dimensions) the variance of the proposal must be small for the 
corresponding acceptance probability to be satisfactory. However smaller proposal 
variance leads to higher autocorrelations, and large computing time to explore the 
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state space. In contrast, and as discussed in the following sections, HMC exploits 
the information on the derivative of the log density to deliver guided, global moves, 
with higher acceptance probability. 

HMC is closely related to the so-called Metropolis- adjusted Langcvin algorithm 
(MALA) [25] which uses the derivative of the log-density to propose steepest-ascent 
moves in the state space. MALA employs Langevin dynamics; the proposal is 
derived from an Euler discretisation of a Langevin stochastic differential equation 
that leaves the target density invariant. On the other hand, HMC uses Hamiltonian 
dynamics. The original variable q is seen as a 'location' variable and an auxiliary 
'momentum' variable p is introduced; Hamilton's ordinary differential equations are 
used to generate moves in the enlarged (q,p) phase space. These moves preserve 
the total energy, a fact that implies, in probability terms, that they preserve the 
target density n of the original q variable, provided that the initial momentum is 
chosen randomly from an appropriate Gaussian distribution. Although seemingly 
of different origin, MALA can be thought of as a 'localised' version of HMC: we 
will return to this point in the main text. 

In practice, continuous-time Hamiltonian dynamics are discretised by means of 
a numerical scheme; the popular Stormer-Verlet or leapfrog scheme [12, 17, 26, 29] 
is currently the scheme of choice. This integrator does not conserve energy exactly 
and the induced bias is corrected via a Metropolis-Hastings accept/reject rule. 
In this way, HMC develops a Markov chain reversible w.r.t. n, whose transitions 
incorporate information on n in a natural way. 

In this paper we will investigate the properties of HMC in high dimensions and, 
in such a context, offer some guidance over the optimal specification of the free 
parameters of the algorithm. We assume that we wish to sample from a density n 
on K w with 

(1.1) n(Q) = cxp(-V(Q)) , 

for V : M. N —> BL We study the simplified scenario where n(Q) consists of d ^> 1 
independent identically distributed (iid) vector components, 

d 

(1.2) n(Q) =exp(-^V(g,)) , V : R m -> K ; N = m x d . 

»=i 

For the leapfrog integrator, we show analytically that, under suitable hypotheses 
on V and as d — > oo, HMC requires 0(d 1 ^ 4 ) steps to traverse the state space, and 
furthermore, identify the associated optimal acceptance probability. 

To be more precise, if h is the step-size employed in the leapfrog integrator, then 
we show that the choice 

(1.3) HMC: h = l-d~ 1/4 

leads to an average acceptance probability which is of 0(1) as d — > oo : Theorem 
3.6. This implies that 0(d 1 ^ 4 ) steps are required for HMC to make 0(1) moves in 
state space. Furthermore we provide a result of perhaps greater practical relevance. 
We prove that, for the leapfrog integrator and as d — > oo, the asymptotically optimal 
algorithm corresponds to a well-defined value of the acceptance probability, inde- 
pendent of the particular target II in (1.2). This value is (to three decimal places) 
0.651: Theorems 4.1 and 4.2. Thus, when applying HMC in high dimensions, one 
should try to tunc the free algorithmic parameters to obtain an acceptance proba- 
bility close to that value. We give the precise definition of optimality when stating 
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the theorems but, roughly, it is determined by the choice of I which balances the 
cost of generating a proposal, which decreases as / increases, against the cost related 
to the average number of proposals required to obtain acceptance, which increases 
as I increases. 

The scaling 0(d^ 4 ) to make 0(1) moves in state space contrasts favorably with 
the corresponding scalings 0(d) and 0(d 1 ^ 3 ) required in a similar context by RWM 
and MALA respectively (see the discussion below). Furthermore, the full analysis 
provided in this paper for the leapfrog scheme may be easily extended to high-order, 
volume-preserving, reversible integrators. For such an integrator the corresponding 
scaling would be 0(d 1 ^ 2u ^), where v (an even integer) represents the order of the 
method. For the standard HMC algorithm, previous works have already established 
the relevance of the choice h = 0(d~ 1 ^ 4 ) (by heuristic arguments, see [11]) and an 
optimal acceptance probability of around 0.7 (by numerical experiments, see [6]). 
Our analytic study of the scaling issues in HMC was prompted by these two papers. 

The paper is organized as follows. Section 2 presents the HMC method and re- 
views the literature concerning scaling issues for the RWM and MALA algorithms. 
Section 3 studies the asymptotic behaviour of HMC as the dimensionality grows, 
d — ^ oo, including the key Theorem 3.6. The optimal tuning of HMC is discussed in 
Section 4, including the key Theorems 4.1 and 4.2. Sections 5 and 6 are technical. 
The first of them contains the derivation of the required numerical analysis esti- 
mates on the leapfrog integrator, with careful attention paid to the dependence of 
constants in error estimates on the initial condition; estimates of this kind are not 
available in the literature and may be of independent interest. Section 6 gathers 
the probabilistic proofs. We finish with some conclusions and discussion in Section 
7. 



2. Hybrid Monte Carlo (HMC) 
2.1. Hamiltonian dynamics. Consider the Hamiltonian function: 

H(Q,P) = \{P,M- 1 P)+V{Q) , 

on M. 2N , where M. is a symmetric positive definite matrix (the 'mass' matrix). One 
should think of Q as the location argument and V(Q) as the potential energy of 
the system; P as the momenta, and (1/2) (P, A4~ 1 P) as the kinetic energy. Thus 
H(Q,P) gives the total energy: the sum of the potential and the kinetic energy. 
The Hamiltonian dynamics associated with ft are governed by 

(2.1) ^F =M ~ lp > ^- = - VV ^' 

a system of ordinary differential equations whose solution flow $ t defined by 

(Q(t),P(t)) = $ t (Q(0),P(0)) 
possesses some key properties relevant to HMC: 



• 1. Conservation of Energy: The change in the potential becomes ki- 
netic energy; i.e., H o $ t = H, for all t > 0, or H(<Z>t(Q(0), P(0))) = 
H(Q(0),P(0)), for all * > and all initial conditions (Q(0),F(0)). 
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• 2. Conservation of Volume: The volume element dP dQ of the phase 
space is conserved under the mapping $t. 

• 3. Time Reversibility: If S denotes the symmetry operator: 

S(Q,P) = (Q,-P) 

then Ho S = % and 

(2.2) So($ t )" 1 o5 = $ i . 

Thus, changing the sign of the initial velocity, evolving backwards in time, 
and changing the sign of the final velocity reproduces the forward evolution. 
From the Liouville equation for equation (2.1) it follows that, if the initial con- 
ditions are distributed according a probability measure with Lebesgue density de- 
pending only on H(Q,P), then this probability measure is preserved by the Hamil- 
tonian flow 4> t . In particular, if the initial conditions (Q(0),P(0)) of (2.1) are 
distributed with a density (proportional to) 

exp(-«(Q,P)) = exp((l/2)(P,X- 1 P))cxp(-V(g)), 

then, for all t > 0, the marginal density of Q(t) will also be (proportional to) 
exp(— V(Q)). This suggests that integration of equations (2.1) might form the 
basis for an exploration of the target density exp(— V(Q)). 

2.2. The HMC algorithm. To formulate a practical algorithm, the continuous- 
time dynamics (2.1) must be discretised. The most popular explicit method is 
the Stormer-Verlet or leapfrog scheme (see [12, 17, 26] and the references therein) 
defined as follows. Assume a current state (Qq,Pq); then, after one step of length 
h > the system (2.1) will be at a state (Qh,Ph) defined by the three-stage 
procedure: 

(2.3a) P h ,2 = Pa-'i VV(Qo) ; 

(2.3b) Qh = Qo + hM- 1 P h/2 ; 

(2.3c) P h = P h/2 - § VV(Qh) . 

The scheme gives rise to a map: 

*h = (Qo,Po) (Qh,Ph) 

which approximates the flow The solution at time T is approximated by taking 
L^J leapfrog steps: 

(Q(T),P(T)) = $ T ((Q(0),P(0)) » *L* J ((Q(0),P(0)) . 
Note that this is a deterministic computation. The map 

may be shown to be volume preserving and time reversible (see [12, 17, 26]) but it 
does not exactly conserve energy. As a consequence the leapfrog algorithm does not 
share the property of equations (2.1) following from the Liouville equation, namely 
that any probability density function proportional to exp(— H(Q, P)) is preserved. 
In order to restore this property an accept-reject step must be added. Paper [20] 
provides a clear derivation of the required acceptance ceritcrion. 
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We can now describe the complete HMC algorithm. Let the current state be Q. 
The next state for the HMC Markov chain is determined by the dynamics described 
in Table 1. 



HMC(Q): 

(i) Sample a momentum P ~ N(0,M). 

(ii) Accept the proposed update Q defined via (Q',P') = ^ h (Q,P) w.p.: 

a((Q, P), (Q', P')) := 1 A expfH(Q, P) - H(Q', P')} . 

Table 1. The Markov transition for the Hybrid Monte-Carlo al- 
gorithm. Iterative application for a given starting location Q°, will 
yield a Markov chain Q°, Q 1 , . . . 



Due to the time reversibility and volume conservation properties of the integrator 
map 4 , j l T \ the recipe in Table 1 defines (see [8, 20]) a Markov chain reversible w.r.t 
n(Q); sampling this chain up to equilibrium will provide correlated samples Q n 
from n(Q). We note that the momentum P is merely an auxiliary variable and 
that the user of the algorithm is free to choose h, T and the mass matrix Ai. In 
this paper we concetrate on the optimal choice of h, for high dimensional targets. 

2.3. Connection with other Metropolis-Hastings algorithms. Earlier re- 
search has studied the optimal tuning of other Metropolis-Hastings algorithms, 
namely the Random- Walk Metropolis (RWM) and the Metropolis-adjusted Langc- 
vin algorithm (MALA). In contrast with HMC, whose proposals involve a deter- 
ministic element, those algorithms use updates that are purely stochastic. For the 
target density n(Q) in (1.1), RWM is specified through the proposed update 

Q' = Q + VhZ , 

with Z ~ N(0, 1) (this sample case suffices for our exposition, but note that Z may 
be allowed to have an arbitrary mean zero distribution), while MALA is determined 
through the proposal 

Q' = Q + - viogn(Q) + Vhz . 

The density n is invariant for both algorithms when the proposals are accepted 
with probability 

Tl{Q')T(Q\Q) 

«(Q,Q) = iA n(Q)mQ/) , 

where 

T(x,y)=P[Q' edy\Q = x]/dy 
is the transition density of the proposed update (note that for RWM the symmetry 
of the proposal implies T(Q, Q') = T(Q', Q)). 

The proposal distribution for MALA corresponds to the Euler discretization of 
the stochastic differential equation (SDE) 

dQ = -Vlo g n(Q)dt + dW, 
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for which II is an invariant density (here W denotes a standard Brownian motion) . 
One can easily check that HMC and MALA are connected because HMC reduces 
to MALA when T = h, i.e., when the algorithm makes only a single leapfrog step 
at each transition of the chain. 

Assume now that RWL and MALA are applied with the scalings 

(2.4) RWM : h = I ■ eT 1 , MALA: /i = /■ cT 1 / 3 , 

for some constant I > 0, in the simplified scenario where the target IT has the iid 
structure (1.2) with m = 1. The papers [22], [23] prove that, as d — > oo and under 
regularity conditions on V (the function V must be seven times differentiable 1 , with 
all derivatives having polynomial growth bounds, and all moments of exp(— V) must 
be finite), the acceptance probability approaches a nontrivial value: 

E[o(Q,Q')]->o(0 e (0,1) 

(the limit a(l) is different for each of the two algorithms). Furthermore, if q®, q\, . . . 
denotes the projection of the trajectory Q°, Q 1 , . , . onto its first coordinate, in the 
above scenario it is possible to show ([22], [23]) the convergence of the continuous- 
time interpolation 

(2.5) RWM : t i-> q [ *' d] , MALA : t i-> q^'^'^ 
to the diffusion process governed by the SDE 

(2.6) dq = -^la(l)V'(q)dt+y/la(l)dw, 

(w represents a standard Brownian motion). In view of (2.4), (2.5) and (2.6) we 
deduce that the RWM and MALA algorithms cost 0(d) and 0(d 1/3 ) respectively 
to explore the invariant measure in stationarity. Furthermore, as the product la(l) 
determines the speed of the limiting diffusion the state space will be explored faster 
for the choice l op t of I that maximises I a(l). While l op t depends on the target distri- 
bution, it turns out that the optimal acceptance probability a(l op t) is independent 
of V. In fact, with three decimal places, one finds: 

RWM : a(l opt ) = 0.234, MALA : a{l opt ) = 0.574 . 

Asymptotically as d — > oo, this analysis identifies algorithms that may be regarded 
as uniformly optimal, because, as discussed in [24], ergodic averages of trajecto- 
ries corresponding to I = l op t provide optimal estimation of expectations E [/(<?)], 
q ~ exp(— V), irrespectively of the choice of the (regular) function /. These investi- 
gations of the optimal tuning of RWL and MALA have been subsequently extended 
in [3] and [4] to non-product target distributions. 

For HMC we show that the scaling (1.3) leads to an average acceptance prob- 
ability of 0(1) and hence to a cost of 0(d 1 ^ 4 ) to make the 0(1) moves necessary 
to explore the invariant measure. However, in constrast to RWM and MALA, we 
are not able to provide a simple description of the limiting dynamics of a single 
coordinate of the Markov chain. Consequently optimality is harder to define. 



although this is a technical requirement which may be relaxed 
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3. Hybrid Monte Carlo in the limit d -> oo. 

The primary aim of this section is to prove Theorem 3.6 concerning the scaling 
of the step-size h in HMC. We also provide some insight into the limiting behaviour 
of the resulting Markov chain, under this scaling, in Propositions 3.8 and 3.9. 

3.1. HMC in the iid scenario. We now study the asymptotic behaviour of the 
HMC algorithm in the iid scenario (1.2), when the number d of 'particles' goes to 
infinity. We write Q = {qi)f = i and P = (pi)f=i to distinguish the individual com- 
ponents, and use the following notation for the combination location/momentum: 

X = { Xi )ti\ x i :=(g i ,Pi)GK aTO . 

We denote by V q and V v the projections onto the position and momentum compo- 
nents of x, i.e. V q (q,p) = q, V p {q,p) = p. 
We have: 

d 

H(Q,P) = Y,H( qi , Pt ); H(q,p):=-(p,M- 1 p)+V(q) , 

i=l 

where M is a m x m symmetric, positive definite matrix. The Hamiltonian differ- 
ential equations for a single (m-dimcnsional) particle are then 

<M, § = M- V |-W(„, 

where V : R m — > R. We denote the corresponding flow by ipt and the leapfrog 
solution operator over one /i-step by iph. 

Thus the acceptance probability for the evolution of the d particles is given by 
(see Table 1): 

d 

(3.2) a(X,Y) = 1 A exp (j^fffo) - Htyjpfa))] 

i=l 

with Y = (yi)f = i = ^^\x) denoting the HMC proposal. Note that the leapfrog 
scheme (2.3) is applied independently for each of the d particles (qi,Pi); the different 
co-ordinates are only connected through the accept/reject decision based on (3.2). 

3.2. Energy increments. Our first aim is to estimate (in an analytical sense) the 
exponent in the right-hand side of (3.2). Since the d particles play the same role, 
it is sufficient to study a single term H(xi) — H{w h We set 

(3.3) A(x, h) := H(tP { P {x)) _ H(<p T (x)) = H{^ ] (x)) - H(x) . 

This is the energy change, due to the leapfrog scheme, over < t < T, with step-size 
h and initial condition x, which by conservation of energy under the true dynamics, 
is simply the energy error at time T. We will study the first and second moments: 

H(h) := E [ A(x, h)]= I A(x, h) e~ H{x) dx , 

s 2 {h) :=E[A(x 7 h)\ 2 } , 
and the corresponding variance 

a 2 (h) = s 2 (h)-n 2 (h) . 
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If the integrator were exactly energy-preserving, one would have A = and all 
proposals would be accepted. However it is well known that the size of A(x, h) is in 
general no better than the size of the integration error ipj^\x) — (Pt(x), i.e. 0(h 2 ). 
In fact, under natural smoothness assumptions on V the following condition holds 
(see Section 5 for a proof): 

Condition 3.1. There exist functions a(x), p(x,h) such that 

(3.4) A(x, h) = h 2 a(x) + h 2 p(x, h) 

with lim/j-j-o p{x, h) = 0. 

Furthermore in the proofs of the theorems below we shall use an additional 
condition to control the variation of A as a function of x. This condition will be 
shown in Section 5 to hold under suitable assumptions on the growth of V and its 
derivatives. 

Condition 3.2. There exists a function D : R 2m — > R such that 

\A(x,h)\ 2 
sup 1 \' 4 n < D(x) , 

0<h<l n 

with 

D{x)e- H{x) dx < oo . 

Key to the proof of Theorem 3.6 is the fact that the average energy increment 
scales as 0(h 4 ). We show this in Proposition 3.4 using the following simple lemma 
that holds for general volume preserving, time reversible integrators: 

Lemma 3.3. Let ip h be any volume preserving, time reversible numerical inte- 
grator of the Hamiltonian equations (3.1) and A(x,h) : R 2 " 1 x K + — > K be as in 
(3.3). // ip : R — > M is an odd function then: 

cp(A(x, h)) e~ H{x) dx = - [ ip(A(x, h)) e-^f'^)) dx 

provided at least one of the integrals above exist. If <p is an even function, then: 

ip{A(x, h)) e~ H( - x) dx = [ ip(A(x, h)) e ~ H ^^ dx , 

provided at least one of the integrals above exist. 

Proof. See Section 6. □ 
Applying this lemma with <p(u) = u, we obtain 

fi(h) = - [ A(x,h) e- H( ^ T> {x)) dx , 



which implies that 

(3.5) 2p(h) = [ A(x, h) [1 - exp(-A(a;, h))] e~ H ^ dx 
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We now use first the inequality \e u — 1| < |it|(e" + 1) and then Lemma 3.3 with 
(p(u) — u 2 to conclude that 



\2p{h)\< I \A{x,h)\ 2 e~ H ^ {x »dx + / \A{x,h)\ 2 e- H{x) dx 
(3.6) <2 f \A{x,h)\ 2 e- H(x) dx = 2s 2 (h) . 



The bound in (3.6) is important: it shows that the average of A(x, h) is actually of 
the order of (the average of) A(x, h) 2 . Since for the second-order leapfrog scheme 
A(x, h) = 0(h 2 ), we see from (3.6) that we may expect the average n(h) to actually 
behave as 0(h 4 ). This is made precise in the following theorem. 

Proposition 3.4. If the potential V is such that Conditions 3.1 and 3.2 hold for 

(T) 

the leapfrog integrator ip^ , then 

lim ~~~t~ = M ■ lim — -^-^ = S , 

for the constants: 

a 2 (x)e- H(x) dx ; p = E/2 . 

Proof. See Section 6. □ 

Next, we perform explicit calculations for the harmonic oscillator and verify the 
conclusions of Proposition 3.4. 



Example 3.5 (Harmonic Oscillator). Consider the Hamiltonian 
that gives rise to the system 



H(q,p) = ^p 2 + ^q 2 



fdq/dt\ = fp 

\dp/dt) \-q 

with solutions 

'q{t)\ _ ( cos(t) sin(i)\ f q(0) 



K p(t)J \-sin(t) cos(t)J \p(0) 
In this case, the leapfrog integration can be written as: 

^ = Mq, P ) = (_V+V/4 1 _ h h 2 /2 ) Q = 3 

and, accordingly, the numerical solution after [-^J steps is given by 



4>h (9>P) = 3 

Diagonalizing S and exponentiating yields: 



cos(9n) , 1 sin(6>n) 



- h 2 /4 sin(0n) cos(6>n) J 

where 6 — cos _1 (l — h 2 /2). Using, for instance, MATHEMATICA, one can now 
obtain the Taylor expansion: 

A(x, h) = H{i{ 1] (x)) - H(x) = h 2 a(x) + h 4 /3(x) + 0(h 6 ) 



10 



A. BESKOS, N.S. PILLAI, G.O.ROBERTS, J. M. SANZ-SERNA, AND A.M.STUART 



where: 

a(q,p) = ((p 2 - g 2 )sin 2 (l) +pc?sin(2)) /8 ; 

f3(q,p) = (-g 2 sin(2)+p<?(2cos(2)+3sin(2)) + p 2 (3 - 3 cos(2) + sin(2))) /192 . 

Notice that, in the stationary regime, q, p are standard normal variables. Therefore, 
the expectation of a(x) is 0. Tedious calculations give: 

Var[ a (.x)] = J_sin 2 (l) ; E [f3(x) ] = sin 2 (l) , 
in agreement with Proposition 3.4- 

3.3. Expected acceptance probability. We are now in a position to identify 
the scaling for h that gives non-trivial acceptance probability as d — > oo . 

Theorem 3.6. Assume that the potential V is such that the leapfrog integrator 
iPh satisfies Conditions 3.1 and 3.2 and that 

(3.7) h = I ■ rf- 1/4 , 

for a constant I > 0. Then in stationarity, i.e., for X ~ exp(— %), 

lim E [ a(X, Y)]=2 $H 2 V5/2) =: a(l) 
where the constant S is as defined in Proposition 3.4- 

Proof. To grasp the main idea, note that the acceptance probability (3.2) is given 

by 

d 

(3.8) a(X,Y) = lAe Rd ; R d = A(x h h) . 

i=l 

Due to the simple structure of the target density and stationarity, the terms A (a;,, h) 
being added in (3.8) are iid random variables. Since the expectation and standard 
deviation of A(x, h) are both 0(h 4 ) and we have d terms, the natural scaling to 
obtain a distributional limit is given by (3.7). Then Rd ~ N{— ^/ 4 S, Z 4 S) and the 
desired result follows. See Section 6 for a detailed proof. □ 

In Theorem 3.6 the limit acceptance probability arises from the use of the Central 
Limit Theorem. If Condition 3.2 is not satisfied and cr 2 (h) = oo, then a Gaussian 
limit is not guaranteed and it may be necessary to consider a different scaling to 
obtain a heavy tailed limiting distribution such as a stable law. 

The scaling (3.7) is a direct consequence of the fact that the leapfrog integrator 
possesses second order accuracy. Arguments similar to those used above prove that 
the use of a volume-preserving, symmetric z/-th order integrator would result in a 
scaling h = 0{d~ 1 ^ 2u ' > ) {v is an even integer) to obtain an acceptance probability 
of 0(1). 

3.4. The displacement of one particle in a transition. We now turn our 
attention to the displacement — of a single particle in a transition n — > n+1 
of the chain. Note that clearly 

(3.9) ql +1 =I n -V q ^p{q^,p^) + {l-I n )q^ I n = W o( x» y») . 

While Conditions 3.1 and 3.2 above refer to the error in energy, the proof of the 
next results requires a condition on the leapfrog integration error in the dynamic 
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variables q and p. In Section 5 we describe conditions on V that guarantee the 
fulfillment of this condition. 

Condition 3.7. There exists a function E : R 2m — > R such that 

sup \£!M^iM < E[x) , 



with 



f E{xfer H ^dx < oo 
it 2 ™ 



Under the scaling (3.7) and at stationarity, the second moment E [ (q™ +1 — q™) 2 ] 
will also approach a nontrivial limit: 

Proposition 3.8. Assume that the hypotheses of Theorem 3.6 and Condition 3.7 
hold and, furthermore, that the density exp(— V(q)) possesses finite fourth moments. 
Then, in stationarity, 

lim E [ (<7™ +1 - q™) 2 \=Cj- a(l) 

d—*oo 

where the value of the constant Cj is given by 

C, = E[(v q Mq,p)-q) 2 ] ; (<z,p) ~ exp(-H( q , P )) . 

Proof. See Section 6. □ 

We will use this proposition in Section 4. 

3.5. The limit dynamics. We now discuss the limiting dynamics of the Markov 
chain, under the same assumptions made in Proposition 3.8. For HCM (as for 
RWM or MALA) the marginal process {<z"}ri>o is not Markovian w.r.t. its own 
filtration since its dynamics depend on the current position of all d particles via 
the acceptance probability a(X n ,Y n ) (see (3.9)). In the case of MALA and RWM, 
{Oi}n>o is asymptotically Markovian: as d — > oo the effect of the rest of the particles 
gets averaged to a constant via the Strong Law of Large Numbers. This allows for 
the interpolants of (2.5) to converge to solutions of the SDE (2.6), which defines 
a Markov process. We will now argue that for HCM {q r i} n >o cannot be expected 
to be asymptotically Markovian. In order to simplify the exposition we will not 
present all the technicalities of the argument that follows. 

It is well known (see for instance [29]) that, due to time reversibility and un- 
der suitable smoothness assumptions on V, the energy increments of the leapfrog 
integrator may be expanded in even powers of h as follows (cf. (3.4)): 

A{x, h) = h 2 a{x) + h 4 f3(x) + 0(h 6 ) . 

Here E [ a(x) ] =0 because from Proposition 3.4 we know that E [ A(x, h) ] = C(/i 4 ). 
Ignoring 0(/i 6 )-terms, we can write: 

a(X n ,Y n ) = 1 A e R ".* +R l * 

with 

d d 
Kd = ~h 2 £{a(a;?) - E [ «(*?) | g? ]} - h^ PK) , 

i=l »=1 



i=i 
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Under appropriate conditions, R™ d converges, as d — > oo , to a Gaussian limit inde- 
pendent of the cr-algebra cr(q™, gj, ■ • ■)• To see that, note that, due to the Strong 
Law of Large Numbers and since h 4 = l 4 /d, the second sum in R" d converges 
a.s. to a constant. Conditionally on a(qi,q%, . . .), the distributional limit of the 
first term in R™ d is Gaussian with zero mean and a variance determined by the 

the a.s. limit of h 4 J2i=i{ a ( x ?) ~ ^ [ a ( x r) I 9? ]} 2 ; this follows from the Martin- 
gale Central Limit Theorem (see e.g. Theorem 3.2 of [14]). On the other hand, 
the limit distribution of K% d is Gaussian with zero mean but, in general, cannot 
be asymptotically independent of c(qx,q2, ■ ■ .). In the case of RWM or MALA, 
the conditional expectations that play the role played here by M[a(xf) \qf] are 
identically zero (see the expansions for the acceptance probability in [22] and [23]) 
and this implies that the corresponding acceptance probabilities are asymptoti- 
cally independent from <r(q™, q% , . . .) and that the marginal processes {q™} 
asymptotically Markovian. 

The last result in this section provides insight into the limit dynamics of {qi} n >a- 

Proposition 3.9. Let Q n ~ n(Q), define 

q? +1 = I" ■ 7>,M5M) + (1 - nil', I" = lff»<a(0 ■ 

and consider in (3.9). Then, under the hypotheses of Proposition 3.8, as 

d — > oo: 

( g r,?i +1 ) A( g r,q? +1 ) ■ 

Proof. See Section 6. □ 

This proposition provides a simple description of the asymptotic behaviour of the 
one-transition dynamics of the marginal trajectories of HMC. As d — > oo, with 
probability a(l), the HMC particle moves under the correct Hamiltonian dynamics. 
However, the deviation from the true Hamiltonian dynamics, due to the energy 
errors accumulated from leapfrog integration of all d particles, gives rise to the 
alternative event of staying at the current position q n , with probability 1 — a(l). 



4. Optimal tuning of HMC 

In the previous section we addressed the question of how to scale the step-size 
in the leapfrog integration in terms of the dimension d, leading to Theorem 3.6. 
In this section we refine this analysis and study the choice of constant I in (3.7). 
Regardless of the metrics used to measure the efficiency of the algorithm, a good 
choice of I in (3.7) has to balance the amount of work needed to simulate a full 
T-leg (interval of length T) of the Hamiltonian dynamics and the probability of 
accepting the resulting proposal. Increasing I decreases the acceptance probability 
but also decreases the computational cost of each T-leg integration; decreasing I 
will yield the opposite effects, suggesting an optimal value of /. In this section 
we present an analysis that avoids the complex calculations typically associated 
with the estimation of mixing times of Markov chains, but still provides useful 
guidance regarding the choice of I. We provide two alternative ways of doing this, 
summarized in Theorems 4.1 and Theorem 4.2. 
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4.1. Asymptotically optimal acceptance probability. The number of leapfrog 
steps of length h needed to compute a proposal is obviously given by \T/h~\. Fur- 
thermore, at each step of the chain, it is necessary to evaluate a(X, Y) and sample 
P. Thus the computing time for a single proposal will be 

Td 1 / 4 

(4.1) C l!d :=\— j —)-d-C LF +d-C , 

for some constants Clf, Co that measure, for one particle, the leapfrog costs and 
the overheads. Let Ei id denote the expected computing time until the first accepted 
T-leg, in stationarity. If N denotes the number of proposals until (and including) 
the first to be accepted, then 

Ei d = C ld E[N] = Q d E[E[N|Q]l = Q d E\ , , 1 N , — T 1 . 

' 1 E[a(X,Y)\Q] J 

Here we have used the fact that, given the locations Q, the number of proposed 
T-legs follows a geometric distribution with probability of success E [a(X,Y) \Q\. 
Jensen's inequality yields 

(4 ' 2) El ' d ~ E[a(X,Y)} =: E '- d ' 

and, from (4.1) and Theorem 3.6, we conclude that: 

lim x E? d = . 

A sensible choice for I is that which minimizes the asymptotic cost E ; * d , that is: 
lopt = argmax eff(Z) ; eff(Z) := a(l) I . 

The value of l op t will in general depend on the specific target distribution under 
consideration. However, by expressing eff as a function of a = a(l), we may write 

(4.3) eff=(^). a .($- 1 (l-|))- 

and this equality makes it apparent that a (I opt) does not vary with the selected tar- 
get. Fig. 1 illustrates the mapping a > eff (a) ; different choices of target distribution 
only change the vertical scale. In summary, we have: 

Theorem 4.1. Under the hypotheses of Theorem 3.6 and as d — > oo, the measure 
of cost E* d defined in (4-2) is minimised for the choice l op t of I that leads to the 
value of a = a(l) that maximises (4-3). Rounded to 3 decimal places the, target 
independent, optimal value of the limit probability a is 

a(l op t) = 0.651 . 

The optimal value identified in the preceding theorem is based on the quantity 
E* d that underestimates the expected number of proposals. It may be assumed 
that the practical optimal average acceptance probability is in fact greater than or 
equal to 0.651. In the next subsection we use an alternative measure of efficiency: 
the expected squared jumping distance. Consideration of this alternative metric 
will also lead to the same asymptotically optimal acceptance probability of pre- 
cisely 0.651 as did the minimisation of Ej* d . This suggests that, as d — > oo, the 
consequences of the fact that E;*^ underestimates Ei td become negligible; proving 
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Figure 1. The efficiency function eff = cff(a). 



analytically such a conjecture seems hard given our current understanding of the 
limiting HMC dynamics. 

4.2. Squared jumping distance. We now consider the chain Q ,^ 1 , ... in sta- 
tionarity (i.e. Q° ~ H(Q)) and account for the computing cost Ci.d in (4.1) by 
introducing the continuous-time process Q N{ - 1 \ where {N(t); t > 0} denotes a Pois- 
son process of intensity = 1/Q <j. If Qd{t) : = denotes the projection of 
Q N W onto the first particle and 8 > is a parameter (the jumping time), we 
measure the efficiency of HMC algorithms by using the expected squared jump 
distance: 

SJV d {5) = E [ (q d (t + 8)- q d {t)f } . 

The following result shows that SJT>d{8) is indeed asymptotically maximized 
by maximizing a(l) I: 

Theorem 4.2. Under the hypotheses of Proposition 3.8: 

lim d 5 / 4 x SJV d - x o(i) I . 

Proof. See Section 6. □ 

4.3. Optimal acceptance probability in practice. As d — > oo, the computing 
time required for a proposal scales as 1/2 (see (4.1)) and the number of proposals 
that may be performed in a given amount of time scales as I. Inspection of (4.1) 
reveals however that selecting a big value of I gives the full benefit of a proportional 
increase of the number of proposals only asymptotically, and at the slow rate of 
0(d^ 1 / A ). On the other hand, the average acceptance probability converges at the 
faster rate 0(d~ 1 / 2 ) (this is an application of Stein's method). These considerations 
suggest that unless g? -1 / 4 is very small the algorithm will tend to benefit from 
average acceptance probabilities higher than 0.651. 



OPTIMAL TUNING OF THE HYBRID MONTE-CARLO ALGORITHM 



15 




91-0 01 900 000 




0S00 920 020 9 LOO OlO'O 900 000 




91. 01 90 00 




SIO 01 90'0 000 



Figure 2. Boxplots of Squared Errors (SEs) from Monte-Carlo 
averages of HMC. For 7 different selections of the leapfrog step- 
size h (corresponding to the different boxplots in each panel); the 
values of h are not shown. We ran HMC 120 times; every run 
was allowed a computing time of 30s. Each boxplot corresponds 
to the 120 SEs in estimating E [f(q) ], for a particular h and /(•). 
Written at the bottom of each boxplots is the median of the 120 
empirical average acceptance probabilities for the corresponding h. 
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Fig. 2 shows the results of a numerical study on HMC. The target distribution is 
a product of d = 10 5 standard Gaussian densities N(Q, 1). We have applied HMC 
with different choices of the step-size h and, in all cases, allowed the algorithm 
to run during a computational time t comp of 30 seconds. We used Monte-Carlo 
averages of the output 

N tcomp 

f=N t — E /(«?) 

to estimate, for different choices of /, the expectation E [/ ] = E[/(q)], q ~ 7V(0,1); 
here N taom denotes the number of T-legs carried out within the allowed time t comp . 
For each choice of h we ran the HMC algorithm 120 times. 

Each of the four panels in Fig. 2 corresponds to a different choice of /(•). In each 
of the panels, the various boxplots correspond to choices of h; at the bottom of 
each boxplot we have written the median of the 120 empirical average acceptance 
probabilities. The boxplots themselves use the 120 realizations of the squared 
distances: (/ — E [ / ]) 2 . The shape of the boxplots endorses the point made above, 
that the optimal acceptance probability for large (but finite) d is larger than the 
asymptotically optimal value of 0.651. 

5. Estimates for the leapfrog algorithm 

In this section we identify hypotheses on V under which Conditions 3.1, 3.2 and 
3.7 in Section 3 hold. 

We set / := -W (the 'force') and denote by f'(q) := (?)>/ (2) (?)>•• • thc 
successive Frechet derivatives of / at q. Thus, at a fixed q, f^ k '(q) is a multilinear 
operator from (R m ) fe+1 to R. For the rest of this section we will use the following 
assumptions on V: 

Assumptions 5.1. The function V : R m — > R satisfies: 

• (i)V eC 4 {R m ^R+). 

• (ii) f , /( 2 ) , /( 3 ) are uniformly bounded by a constant B. 

These assumptions imply that the potential V(q) can grow at most quadratically 
at infinity as |g| — > oo. (If thc growth of V is more than quadratic, then the leapfrog 
algorithm as applied with a constant value of h throughout thc phase space is in fact 
unstable whenever the initial condition is large.) The case where V takes negative 
values but is bounded from below can be reduced to the case V > by adding a 
suitable constant to V . In terms of the target measure this just involves changing 
the normalization constant and hence is irrelevant in thc HMC algorithm. 

5.1. Preliminaries. Differentiating (3.1) with respect to t, we find successively: 

Pit) = f\q(t))M- 1 V {t) , 

q(t) = M-\f(q(t)) , 

P(t) = f {2 \q{t)){M- l p{t) 1 M- 1 p{t)) + f'{ q {t))M- l f{q{t)) , 

q(t) = M- 1 / / (9W)M- 1 p(i) , 
"P(t) = fW(q(t))(M- 1 p(t),M- 1 p(t),M- 1 p(t))+ 

3/«(g(t))(M- 1 /(g(t)),M- 1 p(t)) + /'( g (i))M- 1 /'(e(*))M- 1 /(5W) , 
Tit) = M- 1 f^\q(t))(M~ 1 p(t),M~Mt))+M- 1 f'(q(t))M- 1 f(q(t)) . 
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In this section the letter K will denote a generic constant which may vary from one 
appearance to the next, but will depend only on B, T, ||M||, ||M _1 ||. From the 
above equations for the derivatives and using the assumptions on V, we obtain the 
following bounds: 

(5.1) 

\p(t)\<\f(q(t))\, \q(t)\<K\p(t)\, 
\P(t)\ < K\p(t)\ , \q(t)\<K\f(q(t))\, 
\p(t)\ < K(\p(t)\ 2 + \f(q(t))\) , \q(t)\ < K\p(t)\ , 

\-p(t)\ < K{\p{t)\ 3 + |p(t)||/(g(t))| + \f(q(t))\) , \"9(t)\ < K(\p(t)\ 2 + \.f(q(t))\) . 

5.2. Asymptotic expansion for the leapfrog solution. In previous sections we 
have used a subscript to denote the different particles comprising our state space. 
Here we consider leapfrog integration of a single particle and use the subscript 
to denote the time-level in this integration. The leapfrog scheme can then be 
compactly written as 

h 2 

(5.2) q n+1 = q n + hM-^ + — M" 1 /^ , 

(5.3) Pn+1 = p n + ~f(q n ) + ^f(q n + hM~ l p n + ^-M~\f(q n ) 

We define the truncation error in the usual way: 

h 2 



_ T (9) 



2 



q(t n+1 ) - (q{t n ) + hM-ipitn) + —M- l f{q{t n )j) , 

-r n W := p(Wi) - (p(t n ) + p(q n ) + + hM^p^) + yM" 1 f(q(t n ))) , 

where we have set t n = nh G [0, T]. Expanding (see [12]) we obtain: 
r^ = ^h a q(t n ) + h 4 0(fq(-)\U, 

ri p) = ~ h 3 P(t n ) + 0(\\T (-)lloo) +hO(r^) , 
where, for arbitrary function g: 

\\g(-)\\oo ■= sup \g(t)\ . 

0<t<T 

In view of these estimates, (1/6) h 3 q (t n ) and — (1/12) h 3 P (t n ) are the leading 
terms in the asymptotic expansion of the truncation error. Standard results (see, 
for instance, [13], Section II. 8) show that the numerical solution possesses an as- 
ymptotic expansion: 

q n = q(t n ) + h 2 v(t n ) + 0{h 3 ) , 

Pn = p{t n ) + h 2 u{t n ) + 0(h 3 ) , 

where functions u(-) and w(-) are the solutions, with initial condition u(0) = v (0) = 0, 
of the variational system 



(5.5) 



u(t)\ (0 M-\f(q(t))\ fu(t)\ (±P(t) 

v(t)J ~V o J [y{t)J + {-±q(t) 
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Remark 5.2. Notice here that u(-),v(-) depend on the initial conditions (q(0),p(0)) 
via (<}(■), p{-)) but this dependence is not reflected in the notation. One should keep 
in mind that most of the norms appearing in the sequel are functions of (g(0),p(0)). 

Applying Gronwall's lemma and using the estimates (5.1), we obtain the bound: 

(5-6) IKOIU + IKOIIoc < KQlp^Wl + ||/(g(.))||=o) 

and, by differentiating (5.5) with respect to t, expressing it, v in terms of u, v, and 
using (5.1) again, we obtain in turn: 

(5.7) ||n(-)||oc < i^(|b(-)l| 3 oc + lb(-)l|oc||/(?(0)l|oo + ||/(g(-))lloo) , 

(5.8) ||i3(0lloo<^(lb(-)ll» + ll/(9(-))ll=o)- 



5.3. Estimates for the global error. With the leading coefficients u, v of the 
global errors q n — q(t n ), p n — p(t n ) estimated in (5.6), our task now is to obtain an 
explicit bound for the constants implied in the 0(h 3 ) remainder in (5.4). To this 
end, we define the quantities 

Zn ■= q(tn) + h 2 v(t n ) , 

w„ :=p(t n ) + h 2 u(t n ) , 

and denote by Tn^* , t^* the residuals they generate when substituted in (5.2), 
(5.3) respectively, i.e., 

-t^> = z n+1 - z n - hM- 1 w n - yM" 1 / (z n ) , 

-t<*> = w n+1 -w n - -f(z n ) - -f(z n + hM- x w n + yA/- 1 /(2„)) • 
Since the leapfrog scheme is stable, we have 

(5.9) 0< max T (k„ - z n \ + \ Pn - w n \) < ^ o max T (|r^*| + \r^*\) 

with the constant C depending only on T and Lipschitz constant of the map 
(qn,Pn) !-> (<ln+i,Pn+i), which in turn depends on ||M _1 || and the bound for /'. 
The stability bound (5.9) is the basis of the proof of the following estimation of the 
global error: 

Proposition 5.3. If the potential V satisfies Assumptions 5.1, then for < t n < T, 

\Pn ~ (p(*») + h 2 u(t n ))\ < Kh^Wp^Wl + \\f(q(-))\\ 2 oo + 1) , 
\q n (q(t n ) + h\{t n ))\ < ^ 3 (|bO)llt + ll/(?(-))H 2 oo + 1) • 

Proof. Our task is reduced to estimating r^* , t^* . We only present the estimation 
for Tn^* , since the computations for ri^* are similar but simpler. 
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Indeed, after regrouping the terms, 

7-W* =KWi) -P(t n ) \f{q(t n )) - £/(g(t»+i)) + ^"P(t) 

" v ' 

h 

+ h 2 (u(t n+1 ) - u{t n ) - hf{q(t n ))v(t n ) - J^P(t)) 
, ' 



+ \ (/(<*(*»)) - /(*») + h 2 f'(q(t n ))v(t n ) 



h 

hf., , h 2 



-(f{z n + hM^Wn + — M-V(?(<»))) - - h 2 f(q(t n ))v(t n )) 



14 

+ \{f{zn + hM- l W n + yM-V(?(t n )) - /(Zn + /lM"V + yAf~ 1 /(z„))) 

" v ' 

h 

Now we estimate the above five terms separately. 
I\: We note that 

p(t n +i) - p(t n ) - ~f(q(t n )) - ~f(q(t n+ x)) = p(t n+ i) - p(t n ) - ^p{t n+ i) - ~p(t n ) . 
and by using the estimates in (5.1) it follows that 

\h\ < AT^bOHoc + |b(-)l|oo||/(?(-))Hoo + ||/(g(0)IU) • 

I2' Here we write I2 = h 2 (u(t n +\) — u(t n ) — hu(t n )) so that by (5.7) 
\h\ < Kh\\\ P (-)\\l + |b(-)IUI/(«Q)lloo + ||/(g(-))||=o). 

13 : This term is estimated, after Taylor expanding f(z n ) near f(q(t n )), by 

|/3|<^ 5 (|b(-)l|oo + ||/(<z(-))Hoo) 2 . 

14 : We rewrite this as 

^ (f{q(t n+1 ) + 7#> + h 2 v(t n ) + tfM^vitn)) - f(q(t n+1 )) - h 2 f (q(t n ))v(t n )) 
and Taylor expand around f(q(t n )) to derive the bound: 

I^l<^ 4 (lb(0lll, + Il/(3(0)IIL)- 

^5 : This term is easily estimated as: 

\h\ < KtfWvUU < Kh 5 (\\p(-)\\l + ||/(g(-))||oc) • 
Combining all the above estimates, we have the bound 

\ri p} *\<KhH\\ P (-)\\L + \\f(q(-))\\lo) ■ 
A similar analysis for t^* yields the bound 

Ir^i^^ObOllt + ll/CsO)!^)- 

The proof is completed by substituting the above estimates in (5.9). □ 
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We now use the estimates in Proposition 5.3 to derive the asymptotic expansion 
for the energy increment for the leapfrog scheme (cf. Condition 1). 

Proposition 5.4. Let potential V satisfy Assumptions 5.1. Then, for the leapfrog 
scheme, we get 

A(x, h) = h 2 a{x) + h 2 p(x, h) , 

with 

a(x) = (M- l p{T),u{T)) - (f(q(T)),v(T)) , 
Mx)\<K{\\p(.)\\l + \\f{q{.))\\ 2 00 + l) , 

\ P (x, h)\ < Kh{M-)\\l + \\f(q(-))\\l + 1), < h < 1 , 

where (q(-),p{-)) denotes the solution of (3.1) with initial data x = (q(0),p(0)) and 
u(-),v(-) are the solutions of the corresponding variational system given in (5.5) 
with u(0) = v(0) = 0. 

Proof. We only consider the case when T/h is an integer. The general case follows 
with minor adjustments. By Proposition 5.3, 



A(x, h) = H{^\x)) - H{x) = H{^ { P(x)) - H(<pt(x)) 

(Af _1 (ft 2 u(T) + ft 3 i?] 

+ v(q{T) + h 2 v{T) + h 3 R 2 ^j - V{q{T)) , 



= (M~ 1 p(T), h 2 u{T) + h 3 R{) + - (M-\h 2 u(T) + h 3 ^, (h 2 u{T) + /i 3 i?i)) 



where R\ , i?2 are remainders with 

|i?l| + |i? 2 |<A"(||p(-)l| 4 oo + ll/(?(-))H 2 oo + l) • 

By Taylor expanding V(-) around q(T) we obtain, 

A(x,h) = h 2 ({M- l p{T),u{T)) - (f(q(T)),v(T)))+p(x,h) , 

with 

Ipix^^KKh^p^lt + llfiqi-Ml + l) 
for < h < 1. From the bound (5.6) it follows that 

|a(»)|<^(||pOI|oo||«(0l|oo + ||/(«(0)l|oo|K-)||oo) 

<^(lb(-)llL + ll/(-)llL + i) 

and the theorem is proved. □ 

Our analysis is completed by estimating the quantities ||p(-)l|oo and ||<7( - )l|oo, 
that feature in the preceding theorems, in terms of the initial data (q(0),p(0)). 
We obtain these estimates for two families of potentials which include most of 
the interesting/useful target distributions. The corresponding estimates for other 
potentials may be obtained using similar methods. 

Proposition 5.5. Let potential V satisfy Assumptions 5.1. Lf V satisfies, in ad- 
dition, either of the following conditions: 

(i) / is bounded and 



(5.10) / \V(q)\ 8 e- V( ^dq < 



oo 
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(ii) there exist constants C\,Ci > and < 7 < 1 such that for all \q\ > C%, 
we have V(q) > Ci |o , |' Y / 
then Conditions 3.1, 3.2 and 3.7 all hold. 

Proof. We only present the treatment of Conditions 3.1 and 3.2. The derivation of 
Condition 3.7 is similar and simpler. 

From Proposition 5.4 we observe that function D{x) in Condition 3.2 may be 
taken to be 

C(x-)=A'(|| P (-)||^ + ||/(9(-))l| 4 oo + l)- 
Thus, to prove integrability of D(-) we need to estimate ||p(-)||oo and ||/(<z(-))lloo- 
Estimating ||p(-)l|oo is easier. Indeed, by conservation of energy, 

i<p(f), M-^t)) < ^(p(0),M-yo)) + y(g(0)) , 

which implies 

(5.11) \p(t)\ le <K(\p(0)\ le + \V(q(0))\ s ) . 

Now, we prove integrability of D(-) under each of the two stated hypothesis. 

Under hypothesis (i): Suppose / is bounded. In this case we obtain that |.D(x)| < 
^(ll-P(')lloo + l)i therefore it is enough to estimate ||p(-)l|oo- Since the Gaussian 
distribution has all moments, integrability of D follows from (5.10) and (5.11). 

Under hypothesis (ii): Using the stated hypothesis on V(q) we obtain 

C.WW < V(q(t)) < ±(p(0),M- l p(0))+V(q(0)) , 
which implies that: 

\q(t)\<K(\p(0)\* + \V{q(0)W 

By Assumptions 5.1(h), \f(q(t))\ < K(l + \q(t)\) and arguing as above and using 
the bound (5.11), integrability of D follows if we show that 

\V(q)\ S e- V ^dq < 00 , S = max(8, -) . 

7 

Since \V(q)\ < A(l + |g| 2 ), 

\V{q)\ S e~ v ^dq < K [ (1 + \q\ 2S ) e~ B ^ ' dq < 00 

and we are done. □ 

6. Proofs of Probabilistic Results 
Proof of Lemma 3.3. The volume preservation property of ip^ (•) implies that the 

(T) ~ ^ 

associated Jacobian is unit. Thus, setting x = (%p h ) (y) we get: 



<p(A(x, h)) e- H(x) dx= I f{H{ip ( p (x)) - H(x)) e~ H{x) dx 



Following the dchnition of time reversibility in (2.2), we have: 



22 



A. BESKOS, N.S. PILLAI, G.O.ROBERTS, J. M. SANZ-SERNA, AND A.M.STUART 



for the symmetry operator S such that S(q,p) = (q, — p). Using now the volume 
prescnving transformation y = Sz and continuing from above, we get: 

^(A(x,h))e- H ^ dx 

<p(H(Sz) - H((^)-\Sz))) e-^r)- 1 ^))^ 
<p(H(Sz) - H(S^P(z))) e - H W*™W»dz 

tp(H(z) - H{^\z))) e- H{ ^ T){z)) dz, 

where in the last equation we have used the identity H(Sz) = H(z). □ 

Proof of Proposition 3.4- We will first find the limit of a 2 (h)/h 4 . Conditions 3.1 
and 3.2 imply that: 

A 2 (x h) 

i-^— i- = a 2 (x) + p 2 {x, h) + 2p(x, h)a(x) < D(x) 

and since, for fixed x, A 2 (x,h)/h 4 —> a 2 (x), the dominated convergence theorem 
shows: 

lim 4^ = / a 2 (x)e- H ^dx = E . 
h^o h 4 J R2m 

Now, (3.6) implies that: 

(6.1) lim 4P = , 

v ' h^o h 4 

and the required limit for <r 2 (h)/h 4 follows directly. Then, from (3.5) we obtain 
2p,(h)-u 2 (K) _ 
h 4 ~ 

AQc, h) [ exp(- A(x, h))-l + A(x, h)] _ H[x) u 2 (h) 
h 2 h 2 + h 4 • 

Since for any fixed x, Conditions 3.1 and 3.2 imply that A(x, h) — > as h — > and 
A 2 (x, h) = 0(h 4 ), we have the pointwisc limit 

lim cxp(-A(a;,fe)) -1 + A(x,h) = 
fs-KI h 2 

Using the inequality |u| \e u — 1 — u\ < |u| 2 (e 11 + 2), we deduce that for all sufficiently 
small h, 

\A(x,h)\ \exp(-A(x,h))\ _ H(x) 
h 2 h 2 6 



dx 



< 



^^eM-Mx,h))e-^dx + 2 [ ^ ^ dx 



< 3 / D(x) e~ H ^dx < oo, 



where the last line follows from applying Lemma 3.3 with (f(x) = x 2 and Condition 
3.2. So, the dominated convergence theorem yields 

h->0 h 4 
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This completes the proof of the proposition. □ 

Proof of Theorem 3.6. We continue from (3.8). In view of the scaling h = I ■ d^ 1 ' 4 
we obtain, after using Proposition 3.4: 

E[R d ] = -d-fi(h) ->■ ^ 

and 

Var[i? d ] = d- a 2 (h) -> Z 4 £ . 
The Lindeberg condition is easily seen to hold and therefore: 

Rd-^Roo ■= N(-!^J 4 Z) . 
From the boundedncss of u > 1 A e" we may write: 

E[a(X,Y)} ->E[lAe flo °] , 
where the last expectation can be found analytically (see e.g. [22]) to be: 

E [ 1 A e R ~ } = 2$(-1 2 Vy,/2) . 
This completes the proof. □ 

Proof of Proposition 3.8. For simplicity, we will write just q n , q n+1 and p n instead 
of g™, Pi respectively. Using (3.9), we get: 

(q» +1 - q n f = I n (V q ^P(q n ,p n ) - q n f ■ 

We define: 

d 

(6.2) a~(X n ,Y n ) :=lAcxp{-^A(x",/i)}; I n ~ :=I[r«<o-(x»,r») i 

i=2 

and set 

C = / n -(^vfV>")-<7") 2 ■ 

Using the Lipschitz continuity of u h-> ly < 1Ae u and the Cauchy- Schwartz inequal- 
ity we get: 

EK^-g^-a < |A(a !l) ft)| ia |(7>,Vr ) (? ,, .l' n )-9 n ) 2 k 
Now, Conditions 3.1 and 3.2 imply that 

\A( Xl ,h)\ L2 =0(h 2 ) . 

Also, from Condition 3.7 and the stated hypothesis on the density exp(— V), q n 
and Vqipfr (q n ,p n ) have bounded fourth moments uniformly in h, so: 

\(V q ^P(q n ,p n )~q n ) 2 \L,<C , 
for some constant C > 0. The last two statements imply that: 

(6.3) E\(q n+1 - q n ) 2 - C\ =0{h 2 ) . 
Exploiting the independence between l n ~ and the first particle: 

E[6,]=E[a-(X,y)]xE[ (T q ^P(q n ,p n ) - q n f } — > 

a(0-E[(7>T(<Ap")-e? n ) 2 ] , 
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where, for the first factor we used its limit from Theorem 3.6; for the second factor 
the limit is a consequence by Condition 3 and the dominated convergence theorem. 
Equation (6.3) completes the proof. □ 

Proof of Proposition 3.9. Fix some € M m . We define a~(X n , Y n ) and I n ~ as in 
(6.2). For simplicity, we will write just q n , q n+1 , q" +1 and p n instead of g™, q" +1 , 
q" +1 and p" respectively. 
Wc set 

g n+l = jn- . Vg<pT ( q » ip n) + (l _ g » . 

Adding and subtracting I n ■ V q (fT(q n ,P n )) yields: 

\q n+1 - 9 n+1 \ < \V q (^P(q n ,p n )) ' V q ^ T (q n ,p n ))\ 
(6-4) - I n \{\V q (<p T {q\p n ))\ + \q n \) ■ 

Using the Lipschitz continuity (with constant 1) of u t— > Ijj < i Ae xp(u) : 

(6.5) \r--r\< \A( Xu h)\ . 

Now, Condition 3.7 implies that the first term on the right-hand side of (6.4) 
vanishes w.p.l and Condition 3.f implies (via (6.5)) that also the second term 
vanishes w.p.l. Therefore, as d — > oo: 

q n+l _ g n+l ^ Qj a s _ _ 

Theorem 3.6 immediately implies that I n — > l n , thus: 

g n+l q n+l 

From these two limits, we have q n+1 — > q™ +1 , and this completes the proof. □ 

Proof of Theorem 4-2. To simplify the notation we again drop the subscript 1. Con- 
ditionally on the trajectory q° , q , . . . we get: 

f 0, w.p. f- X d 5 + 0((X d 5) 2 ) , 

(q(t + 6)- q{t)f = { (q N ^ + 1 - q N ^) 2 , w.p. X d S + 0((X d S) 2 ) , 

{ (q^+l+i - q N ^f , j > 1 , w.p. 0((X d Sy + 1 ) . 

Therefore, 

SJV d = E [ (q N ^ +1 - q N ^f ] (X d S + 0((X d 6) 2 )) 

(6.6) +J2^[(q N{t)+1+3 - q N{t) ) 2 ]o((x d 5y +1 ) . 

3>1 



,j+l 



Note now that: 

E[(g"M+i+J -q N W) 2 ] < (J2k m+k -9 JV(t)+fc - 1 |L 2 

fc=i 

= (j + l) 2 E[(q n + 1 -q n ) 2 } , 
since we have assumed stationarity. From (4.f ): 

A d = d- 5 /4_i_ +0(d - 6 / 4) _ 

and, from Proposition 3.8, E [ (q n+1 - q n ) 2 } = <D(l). Therefore, 
<£>' A X Y, E [ (q N{t)+1+1 - q N{t} ) 2 } 0((X d Sy +1 ) 

3>1 
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is of the same order in d as 



^.^x^ + lfOtAf 1 ), 



thus: 



g* (t >) a ]0((W +1 ) = 0(A rf ) . 



Using this result, and continuing from (6.6), Proposition 3.8 provides the required 



The HMC methodology provides a promising framework for the study of a num- 
ber of sampling problems, especially in high dimensions. There are a number of 
directions in which the research direction taken in this paper could be developed 
further. We list some of them. 

• The overall optimization involves tuning three free parameters (l,T,M), 
and since M is a matrix, the number of parameters to be optimized over, is 
even more in general. In this paper, we have fixed M and T, and focussed 
on optimizing the HMC algorithm over choice of step-size h. The natural 
next step would be to study the algorithm for various choices of the mass 
matrix M and the integration time T. 

• We have concentrated on explicit integration by the leapfrog method. For 
measures which have density with respect to a Gaussian measure (in the 
limit d — > oo ) it may be of interest to use semi-implicit integrators. This 
idea has been developed for the MALA algorithm (see [4] and the references 
therein) and could also be developed for HMC methods. It has the potential 
of leading to methods which explore state space in 0(1) steps. 

• The issue of irreducibility for the transition kernel of HMC is subtle, and 
requires further investigation, as certain exceptional cases can lead to non- 
ergodic behaviour (see [5, 27] and the references therein). 

• There is evidence that the limiting properties of MALA for high-dimensional 
target densities do not appear to depend critically on the tail behaviour of 
the target (see [23]). However in the present paper for HMC, we have con- 
sidered densities that are no lighter than Gaussian at innity. It would thus 
be interesting to extend the work to light-tailed densities. This links natu- 
rally to the question of using variable step size integration for HMC since 
light tailed densities will lead to superlinear vector fields at infinity in (2.1). 

• There is interesting recent computational work [9] concerning exploration 
of state space by means of nonseparable Hamiltonian dynamics; this work 
opens up several theoretical research directions. 

• We have shown how to scale the HMC method to obtain 0(1) acceptance 
probabilities as the dimension of the target product measure grows. We 
have also shown how to minimize a reasonable measure of computational 
cost, defined as the work needed to make an 0(1) move in state space. 
However, in contrast to similar work for KWM and MALA ([22, 23]) we 
have not completely identified the limiting Markov process which arises in 
the infinite dimensional limit. This remains an interesting and technically 
demanding challenge. 



statement. 



□ 



7. Conclusions 
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